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Abstract 



Systems for recording address traces of operating system activity have fre- 
quently relied on special-purpose hardware and microcode modifications for 
data collection [1, 2, 10, 11, 30, 32]. In the last decade, changes in computer 
systems design have made the implementation of such hardware and 
microcode-based tracing systems impractical. This paper documents the 
evolution of a group of software methods to collect system traces. The tools 
require no special-purpose hardware and no hardware modifications. We 
have applied these tools to three substantially different operating systems 
and two processor architectures. This paper describes the instrumentation 
techniques, the means used to assure the quality of the collected data, and 
our evaluation of correctness and accuracy of traces. Our experience shows 
that software methods can yield trace of very good quality, and can be used 
to measure complex software systems. 



ii 



1. Introduction 

Address tracing is an important technique for measuring the dynamic behavior of computer 
software systems and the interactions between software and hardware on a computer system. 
The two approaches to collecting address trace data (which we will sometimes call simply trace) 
are hardware methods and software instrumentation. 

Hardware methods involve the use of modified microcode or a special-purpose device to inter- 
cept and record memory references as they occur. With software methods, the program or 
programs of interest are augmented with instrumentation code, such that address trace data is 
generated as a side-effect of program execution. This suggests three problems with software 
methods: 

• The instrumentation process is intrusive and can change the behavior of the traced 
system in a substantial way. 

• Trace from different address-spaces tends to be buffered independently and hence 
partitioned on a per-address-space basis. 

• Operating system code is difficult to instrument for address tracing. 

Hardware methods avoid the problems of software instrumentation by intercepting trace infor- 
mation at a very low level, such that all activity is captured indiscriminate of source, and the 
behavior of software is unaffected. Unfortunately, several properties of current processor design 
make hardware-based tracing difficult: 

• Microcode-based designs are not useful because current processors do not use 
reloadable microcode. 

• Current processors tend to incorporate memory system components such as caches 
and translation buffers into an integrated microprocessor package, such that the re- 
quired address trace information is transferred on submicron- sized structures sealed 
inside the computer chip. This makes them impractical to access with a hardware 
monitor. 

• Modern high-performance computers operate at very high clock speeds. Con- 
sequently, any monitoring device must also operate at a very high speed. This 
makes such a device difficult and expensive to build. 

The problems with hardware tracing led us to take a harder look at software-based methods. In 
this paper we describe and discuss the three tracing systems we have implemented. All are 
variants of the Unix operating system. 

•Traced Tunix: Tunix was derived from DEC Ultrix, version 4.1, and ran on the 
DECWRL Titan [5]. 

• Traced Ultrix: A traced version of Ultrix version 4.2 for the DECstation 5000/200. 

•Traced Mach 3.0: A traced version of Carnegie Mellon' s Mach 3.0 microkernel 
(MK78) and UNIX server (UX39) for the DECstation 5000/200. 

The remainder of the paper is structured as follows. After discussing previous work, Section 3 
discusses the design of the tracing system, beginning with instrumentation tools and fundamental 
notions of system design, then going on to describe details on the Tunix, Ultrix, and Mach 3.0 
tracing systems and how they differ. Section 4 discusses sources of distortion in software-based 
tracing systems and how we avoided them. Section 5 discusses measurements we made to estab- 
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lish that the trace data was a reasonable indicator of real system behavior. In the last section we 
briefly summarize our conclusions. 

2. Previous Work 

Software methods have been applied extensively to study user-only traces, yielding results in 
cache behavior [15, 16, 17, 26, 28], prefetching [6], the importance of long traces [5], the impact 
of context switches [20], and studies of TLB and page behavior [9, 18, 29]. These user-only 
studies are useful but limited, as system activity can have a large impact on overall performance 
[2, 12, 30]. More recent work documenting significant performance problems for system execu- 
tion on RISC-based computer systems [3, 24] suggests that system behavior needs more atten- 
tion in performance studies and hardware design. 

Clark and Emer were among the first to emphasize the importance of system activity when 
modeling memory system behavior. They used direct measurements of hardware to study cache 
performance in the VAX 11/780 [10] and to evaluate the VAX 11/780 translation buffer [12]. 
They point out that direct measurement and simulation have complementary advantages and dis- 
advantages. They also recognized the problem of distortion of system behavior due to tracing. 

A review of more recent hardware tracing projects demonstrates how the obstacles of 
hardware methods limit their applicability in current research. 

In the ATUM system, the microcode of a DEC VAX 8200 was modified to record an address 
trace of system and user execution [2]. The method was applied for both VMS and Ultrix sys- 
tems to test a variety of cache configurations. ATUM has several drawbacks. The foremost is 
that it can only be used on microcoded processors. Also, long contiguous trace was not possible. 
The ATUM designers proposed trace stitching to address this limitation. 

Two interesting projects used hardware monitors to collect system traces of multiprocessor 
workloads. Researchers at Carnegie Mellon University traced an Encore Multimax computer 
[32]. Researchers at Stanford traced cache misses only [30]. Cache misses are less frequent 
than memory references, and this eases the requirements of the hardware device. 

Two recent hardware projects used address traces of uniprocessors. The Monster system 
[21] uses a logic analyzer to capture signals from the CPU chip of a DECstation 3100. They 
have applied their system to study TLB behavior [22] and the system issues for memory system 
design [23]. The BACH system has been used to collect address trace on both Intel 80486 and 
Motorola 68030 based machines [14]. To date they have not demonstrated the applicability of 
their system to RISC based computers. 

In our work we used two instrumentation tools, Mahler [34] and epoxie [35]. There are 
numerous other software instrumentation tools [4, 13, 27], but none to our knowledge have been 
used to collect address traces of operating system behavior. 
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3. The WRL-CMU Tracing Systems 



3.1. High-Level Design 

The design of the tracing systems was motivated by the need for accurate simulations of the 
large memory systems that are required by state-of-the-art processors. This had the following 
implications: 

• The traces must be complete. They must represent the kernel and multiple users as 
they execute on a real machine. The memory references must be interleaved as they 
are during execution rather than being artificially interleaved separate traces. 

• Traces must be accurate. The mechanism used must not distort execution to the 
extent that the behavior of the system is no longer realistic. 

• Traces must be flexible. It must be possible to pick and choose the processes to be 
traced, optionally trace kernel execution, and turn tracing on and off at any time. 

• The traces must be long enough to make possible the realistic simulation of very 
large caches. Since traces of the required length can outstrip storage capacity, trace 
analysis that must be done off-line against stored traces is unacceptable. 

Figure 1 shows a high-level diagram of the tracing system. The system involves three kinds of 
entities: traced user processes, the traced kernel, and an analysis program which consumes the 
trace. The kernel controls the tracing system. Appropriate mechanisms are used to avoid tracing 
kernel activity that occurs on behalf of the tracing system. 




Figure 1: Overview of the tracing system. 

At any instant during a tracing experiment the system is operating in one of two modes: trace- 
generation or trace-analysis. During trace-generation, trace from user-processes goes first into a 
per-process buffer. When that buffer becomes full, a kernel trap occurs and the per-process trace 
is copied into the large in-kernel buffer. When the in-kernel buffer becomes full, the system 
switches from trace-generation to trace-analysis, during which an analysis program (such as a 
memory system simulator) digests the trace. Analysis continues until all pending trace has been 
analyzed and the in-kernel buffer is empty. 
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In addition to copying trace from per-process buffers when they become full, available trace is 
copied into the kernel each time the kernel is activated. As the kernel is invoked between every 
change of context, the interleaving of trace from all the various sources is preserved. 

A certain number of kernel modifications were required to support user tracing, independent 
of the generation of kernel trace. An in-kernel trace buffer was set up. This is allocated stati- 
cally at boot time and is never seen by the kernel memory management subsystem. Exception 
handlers were modified to copy trace from per-process buffers into the in-kernel buffer whenever 
traced user processes are interrupted. Further, a mechanism is provided for the analysis program 
to extract trace from the in-kernel buffer. In Tunix and Ultrix, a memory special file is used 
(similar to /dev/kmem). In Mach 3.0, the in-kernel buffer is mapped into the virtual address 
space of the analysis program. 

A kernel call was added to both systems, to provide a mechanism for user-level analysis 
programs to control tracing. Process creation was modified to initialize tracing data structures. 
Scheduler modifications were used to insure that traced processes are inactive during trace 
analysis. 



3.2. Software Instrumentation 

This section describes the software instrumentation performed by the CMU version of a tool 
called epoxie [35], which was used for the traced Ultrix and Mach 3.0 systems on the DECsta- 
tion. The tools used for the traced Tunix system differ in several ways; in particular they were 
integrated into the compiler/loader system, but the overall approach is similar to that used by 
epoxie. 

Epoxie is similar in spirit to the pixie tool from MIPS Computer Systems [27], which can be 
used to insert address -tracing code into an executable. Since the instrumentation code causes the 
program text to expand considerably, addresses of procedures and branch targets change; address 
references must be adjusted in the instrumented version if the program is to run correctly. Pixie 
does some of this address correction statically, when the original executable is rewritten as an 
instrumented executable, but it must do part of it dynamically, by including a complete address 
translation table in the instrumented executable and doing lookups in this table during execution 
of the instrumented program. 

Epoxie differs from pixie in that it rewrites object files at link time. Modifying object code at 
link time is easier than modifying an executable, because the symbol and relocation tables 
present in object code allow epoxie to distinguish unambiguously between uses of addresses and 
uses of coincidentally similar constants. This information also allows all address correction to be 
done statically, incurring no runtime overhead. The addition of address-tracing code results in a 
significant increase in text segment size. For this project, extensive modifications were made to 
epoxie to minimize text expansion. The text growth factor ranges between 1.9 and 2.3*. This 
compares very favorably with pixie, QPT [4], and the original epoxie, all of which expand the 



Actual growth depends on the length of basic blocks and the density of memory instructions. 
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text by a factor of 4-6 when used for address tracing . It should be noted that minimal text 
growth was not a design objective for any of the earlier tools. 

Note that the expansion of traced text does not affect the trace addresses generated, as the 
addresses seen by the simulator corresponding to the uninstrumented binary. The motivation for 
this modification was to minimize the additional I/O and VM behavior that occurs as a result of 
text growth. I/O and VM effects are discussed in greater detail in Section 5. 

A last crucial difference between epoxie and other instrumentation tools is that the other tools 
work only for single application programs. Epoxie has the flexibility to be used in a tracing 
system with multi-process workloads, threaded tasks in Mach 3.0, and operating system kernels. 

Epoxie inserts trace-collecting code at the beginning of each basic block and before every 
memory instruction of the original program text. Figure 2 shows an example of a code sequence 
before and after instrumentation. 





f open 


i + 0 


addiu 


i + 1 


sw 


i + 2 


sw 


i + 3 


jal 


i + 4 


sw 



sp, sp, -24 
ra, 2 0 ( sp) 
aO, 24 (sp) 
_f indiop 
al, 28 (sp) 




f open : 


i ' 


+ 0 




sw 


ra, 124 (xreg3) 


i' 


+ 1 




jal 


bbtrace 


i' 


+ 2 




li 


zero, 4 


i ' 


+ 3 




addiu 


sp, sp, -24 


i' 


+ 4 




jal 


memtrace 


i' 


+ 5 




addiu 


zero, sp, 20 


i ' 


+ 6 




sw 


ra, 2 0 ( sp) 


i' 


+ 7 




jal 


memtrace 




+ 8 




sw 


aO, 24 (sp) 




+ 9 




jal 


memtrace 




+ 10 


sw 


al, 28 (sp) 




+ 11 


jal 


_f indiop 


i ' 


+ 12 


nop 





a) Before Instrumentation 



b) After Instrumentation 



Figure 2: Instrumentation by epoxie 



Each basic block is preceded by a three instruction sequence, as in instruction i ' + 0 . . i ' +2. 
The jump instruction at i ' + 1 is a call to a basic block trace routine bbt r ace that will store the 
jal's return address into the trace buffer. During trace analysis, the trace parsing library will 
use static information about the binary image to map this address to the correct basic block ad- 
dress in the original (uninstrumented) binary. 

The jal instruction destroys the return address register ra, so instruction i' + 0 saves ra in 
the trace bookkeeping area before bbtrace is called, bbtrace and memtrace restore the 
contents of ra before they return. The delay slot of the jal bbtrace contains a special no-op 
(instruction i' +2), a load-immediate to the read-only register zero, with the number of words 
of trace generated by the basic block in the immediate field. This will be used by bbtrace to 
determine if there is enough room in the user trace buffer for trace from this basic block to be 
stored. 



" For a gcc binary with 688128 bytes of text, pixie -t gcc grows program text to 4131968 bytes. Epoxie 
-t gcc grows text to 3780608 bytes. QPT expands gcc text by a factor of 5.5 [19]. The modified epoxie grows 
text to 1515520 bytes. 
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The tracing system requires three registers for its own use, referred to symbolically as xregl, 
xreg2, and xreg3. Uses in the original binary of these stolen registers are replaced with se- 
quences of instructions that use a "shadow" value for the register, in memory. 

Memory instructions are typically expanded into a two instruction sequence, a jal 
memtrace with the memory instruction in the delay slot, as with i + 2 from Figure 2. 
Memtrace partially decodes the instruction in the branch delay slot to compute the address of 
the memory reference. Certain hazard conditions sometimes make it impossible to put the ac- 
tually memory instruction in the branch delay slot. An example is instruction i + 1 in the ex- 
ample above, which reads the ra register. For such cases, a no-op with the same base register 
and offset as the memory instruction is used in the delay slot, and the real memory instruction is 
issued after the call to memtrace. 

For a given input, a traced program executes many more instructions than the original binary, 
so execution time is longer. In Section 4 we discuss sources of distortion in the traced system 
and how we have controlled them. 

3.3. Tracing the Kernel 

Features to facilitate complete system tracing are important in our design. An example is the 
trace format. In the Ultrix and Mach 3.0 systems, a trace entry for a basic block or memory 
reference is a single machine word. This means that a single machine instruction records a com- 
plete trace entry. In this way, trace entries remain contiguous, with no locks or other protection 
mechanisms required. Another feature that helps accommodate system tracing is that control of 
the tracing system resides in the kernel. This centralized control makes it possible to preserve 
the interleaving of trace from various sources. In systems like Pixie and QPT where trace is 
managed at user level, preserving this interleaving is difficult. 

Several peculiarities of operating system kernels make instrumentation a substantially dif- 
ferent problem from instrumenting an application program. The foremost is the presence of 
uninstrumented code. Certain parts of the kernel are not rewritten by the instrumentation tool, 
either because they are part of the tracing system and should not be traced, or because they are 
too delicate to be rewritten mechanically. Uninstrumented code in the traced kernel must be 
carefully handled so as to preserve and maintain the state of the tracing system. 

A second problem with tracing the kernel is the need to manage the tracing system. Traced 
applications are serviced by the trace-control subsystem when their trace buffers become full, 
with per-process user trace copied into the large in-kernel buffer. Trace of operating system 
activity goes directly into the in-kernel buffer, so the in-kernel buffer can become full at an ar- 
bitrary point during system activity. However, servicing the full buffer is a complicated opera- 
tion, and cannot be scheduled arbitrarily. Provisions must be made for critical system operations 
to complete before tracing is suspended. 

A third problem is the concurrency introduced by interrupts and exceptions. With traced user 
activity, activity from concurrent traced user-level activities is always isolated by an invocation 
of the kernel. This provides an occasion for the kernel to maintain trace system state. There is 
no such opportunity for a traced kernel, as no intermediate party is available to maintain the 
kernel's tracing state when the kernel itself is interrupted by an exception. To address this 
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problem, the exception handling mechanism in the kernel must be modified to correctly handle 
trace state, and the trace-analysis system must correctly handle situations when arbitrary kernel 
activity is interrupted by an exception. 

All relevant parts of the kernel are traced. Routines too delicate to be instrumented by epoxie 
were instrumented by hand. Certain code, executed only at boot time or after an unrecoverable 
system error, is not instrumented. 

3.4. Tunix on the Titan 

Our first tracing system was implemented for the Titan [5], an early experimental RISC 
workstation with a 45ns cycle time and different register sets for user and kernel. The Titan ran 
a modified version of Ultrix called Tunix. Process and memory management functionality were 
rewritten for Tunix. 

All Titan compilers used a common intermediate language, Mahler, which defined a Mahler 
abstract machine. The Mahler implementation [33, 34] consists of a translator and an extended 
linker. Object modules produced by Mahler contain sufficient supplementary information to 
support the code modification required for address trace generation. In particular, basic blocks 
and their sizes are identifiable at link time. The linker augmented code to be traced with in- 
strumentation code so that traced programs record addresses and lengths of basic block and load 
and store target addresses when executed. 

A single large trace buffer was managed by the operating system and mapped into every ad- 
dress space. In traced workloads, the compiler reserved five of the 64 user registers for use by 
the tracing system. Since all tracing data was sent to the same buffer, trace data was correctly 
interleaved for multiple processes and the kernel. 

Only specially linked programs were traced. Kernel tracing was turned off when the kernel 
acted on behalf of an untraced program. In particular, it was turned off while the trace analysis 
program was running. 

Tunix kernel tracing worked well enough to demonstrate that software -based kernel tracing 
was possible. Preliminary cache simulation experiments showed that kernel cycles per instruc- 
tion (CPI) were three times user CPI, and had a significant effect on overall CPI. 

Unfortunately, there were a number of problems. As kernel address references were to physi- 
cal addresses and user references were to virtual addresses, it was difficult to determine when a 
user and a kernel address actually referred to the same memory location. Also, portions of the 
Tunix kernel, such as the software TLB miss handler, were not traced. Finally, the internals of 
Tunix were sufficiently different from commercial operating systems that we were reluctant to 
draw general conclusions from the behavior of Tunix. The traced Tunix system established the 
potential of software-based system tracing, and gave us the necessary experience to move our 
techniques to a more mainstream operating system. The Tunix tracing system also produced a 
collection of single and multi-task user-level traces on tape, which were made available to the 
community for use in memory system research. 
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3.5. Ultrix on the DECstation 5000/200 

Epoxie was used to instrument both Ultrix and Mach 3.0. With the Titan machine architecture 
and with the instrumentation tools integrated with the compiler system, it was straightforward in 
Tunix to reserve registers for address tracing. In contrast, epoxie operates on binaries after com- 
pilation, so registers reserved for tracing had to be "stolen." The necessity of register-stealing 
complicated the implementation of the tracing system, creating additional trace-system state to 
be maintained and additional invariants to be observed. 

Another source of added complexity in the DECstation tracing systems was the handling of 
nested interrupts. The Titan had only one interrupt level, so nested interrupts were impossible. 
The nested interrupts on the DECstation require the tracing system to use a stack to maintain its 
state during multiple nested system invocations. 

An interesting change between traced Tunix and traced Ultrix is the handling of basic block 
records. Mahler and Epoxie both generate static information describing each basic block (num- 
ber of instructions, position of loads and stores). This information is used when the trace is 
analyzed, to determine the correct interleaving of instruction and data memory references. In 
traced Tunix, the basic block records were written into the trace along with the traced addresses. 
In the Ultrix system, only the basic block address is written. A lookup table is used in the trace 
parsing library to find static information for a given basic block address. One advantage of this 
technique is it makes the trace more concise, so the trace takes less space and less time to write. 
Another advantage is that the basic block lookup creates an opportunity for implementing special 
behaviors for a specific basic block address. An example is hand-traced code. The trace-parsing 
system can recognize the basic block record of a hand-traced routine as special, and respond 
accordingly. Another example is instruction counting, with flags in basic block records to start 
and stop counters. An example application of these counters is measuring activity in the idle- 
loop. 

3.6. Mach 3.0 on the DECstation 5000/200 

Mach 3.0 is a microkernel that implements and exports a small number of low-level system 
services, with higher-level services implemented in a user-level UNIX server. The Mach 3.0 
virtual memory interface [25] permitted a number of improvements in the implementation of the 
tracing system. In the analysis program, trace was extracted from the kernel by mapping the 
in-kernel buffer into the analysis programs address space, eliminating copying and buffering of 
trace data. 

Another use of Mach 3.0 virtual memory primitives is dynamic allocation of the per-process 
trace pages. In the Ultrix system, a flag was set in the executable image to indicate that a process 
was traced. This flag is checked when a traced program is started. Traced programs get per- 
process trace pages, and are scheduled according to the state of the tracing system. The Mach 
3.0 system identifies traced programs by detecting references to the per-process trace pages. 
This feature in Mach 3.0 is particularly important for the implementation of multiple traced 
threads in a single address space, as independent trace pages are allocated for each thread. 
Context- switching code in the kernel maps the correct per-thread pages when a new thread is 
activated. 
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4. Maintaining Trace Quality 

Instrumenting the system involves substantial modifications to every instruction of active sys- 
tem code. As such, one of the original goals in the design of the tracing system was to control 
the impact of instrumentation on system behavior. 

4.1. Avoiding Trace Distortion 

Tracing with software methods induces two kinds of distortion on system behavior, memory 
dilation and time dilation. 

Memory Dilation 

A program instrumented with epoxie is about a factor of two larger than its untraced counter- 
part. This can affect paging and TLB miss behavior. We avoid perturbations due to paging 
behavior by collecting our traces on a machine with a large physical memory, such that pageouts 
do not occur. We feel this is a reasonable simplification, since the main utility of our tools is in 
analyzing memory system performance, and most aspects of memory system performance be- 
come irrelevant when significant paging activity is present. 

TLB behavior is slightly more subtle. The DECstation address spaces is divided into four 
segments, two mapped and two unmapped. All kernel text and most kernel data is referenced 
through the unmapped segments; hence these references do not affect the TLB. The two mapped 
segments do require translations from the TLB, and each handles TLB misses differently. A 
miss to the user segment is called a UTLB miss and is handled in software via a dedicated excep- 
tion vector and a nine-instruction miss handler routine. A miss to the mapped kernel segment is 
called a KTLB miss. They are handled through the general exception mechanism, which is much 
slower (several hundred instructions). Fortunately, KTLB misses are more rare. 

Instrumentation causes the number of user text pages to grow by a factor of two. With twice 
as many text pages, UTLB miss behavior can differ substantially between traced and untraced 
workloads. Because of the different behavior, trace from the actual user TLB miss handler 
would not be representative of the untraced system. Rather than tracing the UTLB miss handler, 
we simulate the TLB, and use misses in the simulator to synthesize the activity of the UTLB 
miss handler. 

Mapped kernel memory is used primarily to map page table pages. If instrumentation changed 
the number of page table pages required to map user text, then KTLB miss behavior could be 
affected. Fortunately, each page table page can map 4 megabytes of contiguous memory. As the 
largest traced binary has less than 2 megabytes of text, the number of page table pages it requires 
does not change, and the behavior of the KTLB miss handler is unaffected. 

Time Dilation 

The instructions added by software instrumentation cause traced programs to execute about 
fifteen times more slowly than their untraced counterparts. Temporal relationships for activity 
that depends on the speed of CPU instruction execution are unaffected, as the slowdown for all 
instrumented code is roughly the same. Time dilation occurs because activities independent of 
CPU speed appear to occur about fifteen times faster for the traced system. For the workloads 
we have considered, this affects clock interrupts and the latency of I/O operations. Adjusting for 
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clock interrupts was straightforward: we configured the system clock to interrupt at l/15th the 
standard rate. 

We have not modified I/O behavior to account for time dilation, as this would require subtle 
system changes that might themselves introduce other distortions. Instead, we estimate I/O 
delays using a count of the number of instructions executed while waiting in the system idle- 
loop. We estimate I/O delays in the untraced system by multiplying idle time in the traced sys- 
tem by a scaling factor of fifteen. The approximation this gives is very rough, but largely ade- 
quate for our purposes, as I/O delays are of little interest in memory system behavior. 

Scheduler policy is also affected by time dilation, but is an issue we have chosen not to ad- 
dress. Instead, we concentrate on workloads such as single process workloads and client- server 
systems. For these workloads, all context switches are determined by client-server relationships, 
and scheduler policy is irrelevant. For accurate traces of timesharing workloads, it would be 
necessary to scale VO delays and adjust scheduler policy to replicate untraced behavior. It 
should be possible to improve the behavior of the traced system, although perfect reproduction of 
traced behavior is not a practical goal. Given current trends toward single-user machines and 
away from timesharing, the limitation of client-server systems leaves ample domain for our 
research. 

4.2. Page Mapping Policy 

The virtual to physical page map is determined by policy implemented in operating system, 
and can have significant impact on memory system behavior. [7, 18] An address trace obtained 
through software methods contains virtual addresses, yet caches are often indexed by physical 
addresses. A trace-based simulation of such a physical cache requires some virtual-to-physical 
address translation. The most straightforward approach is to implement the desired page map- 
ping policy in the simulator. The traced Ultrix and Mach 3.0 kernels also provide the option of 
extracting the page-map from the running system. 

4.3. Defensive Tracing 

When possible, the validity of tools was tested in isolation from the rest of the system. The 
system was further tuned and corrected by looking for anomalies in measured behavior. In this 
section we discuss redundancy and error modes built into the tracing system that are helpful for 
avoiding certain kinds of errors. In the next section we discuss our final means of evaluating the 
quality of trace, measuring the ability of a trace driven simulation to predict measurements of an 
uninstrumented system. 

The following discussion applies primarily to the traced Ultrix and Mach 3.0 systems. 

The correctness of trace generated by epoxie instrumentation was validated by comparing 
epoxie trace for deterministic user programs to trace from a CPU simulator. The verification of 
trace from epoxie against trace from an independently developed CPU simulator establishes with 
a high degree of certainty the correctness of epoxie instrumentation. 

In the operating system kernel, code rewritten by epoxie co-exists with hand-instrumented 
code, as well as the uninstrumented code that implements certain parts of the tracing system. 
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Unlike user code, the kernel is significantly involved in controlling the state of the tracing sys- 
tem. A number of approaches were used to ascertain that tracing the kernel did not introduce 
errors or unexpected trace distortion. 

• The format of trace contains a significant degree of redundancy, such that missing 
words of trace or erroneous writes into the trace are detected with a very high prob- 
ability. Conditions checked include (i) that each instruction basic block address is 
valid for the address space in question, and (ii) that in each basic block the expected 
number of memory operations occurs. 

• A large number of sanity checks were used to verify that trace was not being misin- 
terpreted. For example, the simulator checks that all kernel instruction addresses are 
in the kernel instruction address space. 

• Reference counting tools were used to make a dynamic count of the number of times 
each instruction in the kernel was executed. In this way it was possible to identify 
anomalous system activity caused by errors in the tracing system. 

Each time the tracing system changes from trace-generation mode to trace-analysis mode, a 
certain amount of "dirt" is introduced into the trace, where the activity reflected in the trace 
does not accurately reflect a run of an uninstrumented system. No user-level trace is lost, but it 
is possible for some amount of errant system activity to be measured or ignored. As an example, 
an I/O request might be made during trace-generation mode, but complete during trace-analysis 
mode. The trace from the completion of the I/O request would then be lost. The approach taken 
to minimize the inaccuracies introduced by these transitions was to be sure they are rare, by 
making the in-kernel trace buffer large. The current system uses a 64 megabyte buffer. A buffer 
of this size permits approximately 32 million instructions of continuous execution between trace 
analysis phases. For an untraced system, this corresponds about two seconds of continuous ex- 
ecution. 

4.4. Truths Revealed 

In debugging the tracing system, a procedure used repeatedly was to identify anomalous be- 
havior indicated by simulator output and trace it back to a bug in the simulator or simulation 
model. Eventually, these investigations stopped revealing problems in the experimental system, 
and began to expose unexpected behavior from the actual hardware and operating system im- 
plementation. Some of these behaviors include: 

• A bug in the instruction cache flushing routine caused an excessive number of un- 
cached instruction references in Mach 3.0. 

• System activity accounts for about 1% of execution time in tomcatv, but system 
policy in the virtual-to-physical page selection can cause execution time to vary by 
over 10%. 

• Conservative write policies in Ultrix induces greatly increased I/O delays. 

Overall, the ability of the tracing system to reveal these unexpected behaviors demonstrates that 
the combined tracing/simulation system accurately reflects the behavior of the uninstrumented 
system. 



11 



Software Methods for System Address Tracing 



5. Validation of Methods 

This section describes how we measured the fidelity of the experimental system to real system 
behavior. The measurements use the example workloads are described in Table 1. Overall, 
these measurements demonstrate that the tracing/simulation system is a good model for real sys- 
tem behavior. 



Workload 


Description 


sed 


The UNIX stream editor run three times over the same 17K input file. 


egrep 


The UNIX pattern search program run three times over a 27K input file. 




Thp T R ( 1 ^ nnrQpr-apnprntor run on an 1 1 \C arammar 

J. 11C l_/l\.^ 1 ) JJtUSCl gClld tlLAJl 1 Ull VJll till 1 1 I\ ££1 a.lllllla.1 . 


(j c c 


Thp frNTT T ennini \cv (occ\ trnnQlntintr n 1 7\C (r\vf k r\rr\nf k( i ( if k f\\ Qonrpp flip into ontimi7pH 

L 11C VJ 1 X V. ^VJlllJJllCl J LI CUiaiClllllg tl 1 / IV ^JJICJJIVJ^CSSCLI^ SVJUl^C 111C 111LVJ VJJJLlllllZ-CLl 

Sun-3 assembly code. 


compress 


Data compression using Lempel-Ziv encoding. A 100K file is compressed then uncompressed. 


espresso 


A program that minimizes boolean function run on a 30K input file. 


lisp 


The 8-queens problem solved in LISP. 


eqntott 


A program that converts boolean equations to truth tables using a 1390 byte input file. 


fPPPP 


A program that does quantum chemistry analysis. This program is written in Fortran. 


doduc 


Monte-Carlo simulation of the time evolution of a nuclear reactor component 
described by 8K input file. This program is written in Fortran. 


liv 


The Livermore Loops benchmark. 


tomcatv 


A program that generates a vectorized mesh. This program is written in Fortran. 



Table 1: Experimental workloads with execution times for a DECStation 5000/200. 

Except where indicated, all programs are written in C. The bottom four workloads are 
floating-point intensive. 



5.1. Program Execution Time 

We used a high resolution timer to measure execution times of the workloads. In Table 2, we 
compare the measured times with times predicted from a trace-driven simulation of the DECsta- 
tion 5000/200 memory system. 

The predicted times in Figure 2 include contributions from four different sources: 

• CPU cycles 

• memory system stalls 

• arithmetic stalls 

• I/O stalls 

Each instruction executed contributes one CPU cycle to the total execution time. Memory sys- 
tem stall cycles are calculated by multiplying counts of penalty events (cache read misses, un- 
cached reads, and write-buffer stalls) by the number of stall cycles per event. Pixie [27] was 
used to estimate arithmetic stalls, as the tracing system does not measure these events. 

The estimate of I/O stalls is derived from a count of idle-loop instruction references made 
from the memory reference trace. Information on idle-loop activity from the trace must be ad- 
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workload 


Mach 3.0 


Ultrix 


measured 


predicted 


measured 


predicted 


sed 


0.58 


0.48 


0.48 


0.54 


egrep 


2.05 


2.02 


1.94 


1.90 


yacc 


1.70 


1.68 


1.80 


1.79 


gcc 


2.26 


3.21 


4.10 


4.16 


compress 


1.38 


1.17 


1.26 


1.11 


espresso 


6.03 


6.21 


6.43 


6.40 


lisp 


62.0 


56.6 


53.3 


53.6 


eqntott 


66.1 


65.7 


65.6 


65.8 


fpppp 


15.9 


16.7 


15.9 


15.7 


doduc 


20.7 


21.4 


21.7 


21.2 


liv 


1.29 


1.29 


1.17 


1.26 


tomcatv 


139.2 


137.0 


155.4 


153.5 



Table 2: Run Times, measured and predicted, in seconds. 

The predicted times are the sum of machine cycles from four different sources: instruction 
execution, memory system stalls, I/O time, and arithmetic stalls. The first three values are 
measured by the tracing system. Estimates for arithmetic stalls are as measured by pixie [27]. 
Execution times in this table are for runs with revision 3.0 of the floating point unit. Also, the 
buffer cache was warmed with executable text before program execution. 

justed to compensate for the effects of time-dilation on the execution of idle loop. As an ex- 
ample, consider a workload that executes 15 million instructions in the idle loop while waiting 
for completion of synchronous disk I/O. This corresponds to some amount of real time required 
for I/O operations by the disk. Address tracing does not change the latency of disk operations, 
but time dilation changes the execution rate of the idle-loop. Suppose that instrumented code is 
slower than uninstrumented code by a factor of fifteen. Then only l/15th as many or 1 million 
idle-loop instructions will be recorded in the trace. Idle-loop instruction counts from the trace 
must be scaled to compensate for the slowdown in idle-loop execution. For the predictions of 
program execution time from trace data, 15 is used as an estimate of the effect of instrumentation 
on the idle loop. 

The running times from simulator data are rough estimates, and are subject to error from a 
number of sources. 

• Disk Latency and Idle time. The simulator's model of disk delays is only an ap- 
proximation of real behavior. This approximation introduces distortions to time es- 
timates in two ways. First, Some system activity is missed when tracing is inter- 
rupted during a disk request. Second, tracing changes the behavior of disk read 
ahead. Some read- ahead requests which complete in the traced system don't com- 
plete in the standard system. This results in idle time for the standard system that 
does not occur in the traced system. 

• Lack of pipeline model. The simulation system does not model the CPU pipeline. 
Although there is a mechanism to model the correct sequencing of instruction reads 
and data reads and writes, two other behaviors are not modeled: 

• Floating point latency can overlap with write buffer cycles and cache misses 
in the DECstation 5000/200. This overlapping is not modeled in the 
simulator. 

• The simulator does not account for cycles required to enter and exit exception 
handlers. 
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• Page mapping policy. Cache performance can vary significantly depending on the 
virtual to physical page mapping in use. This affects the repeatability of workload 
behavior, particularly for the random page mapping policy used in Mach 3.0. 

• Clock interrupt frequency on both systems was scaled by a factor of fifteen to com- 
pensate for time dilation. This is a coarse approximation. 




Figure 3: Error in predicted execution times for Ultrix. 

The relatively large prediction errors for sed, compress, and liv are explained by inaccuracies 
in the simulation model used for prediction. See the text for a complete discussion. 

Figure 3 shows percent error for predictions of execution time for twelve workloads running 
under Ultrix***. Predictions for most of the workloads are quite good. Three of the workloads 
have errors greater than five percent. The explanation for these errors give interesting insight 
into the behavior of the tracing system: 

• Sed has the shortest execution time of all the workloads, under 0.5 seconds for three 
runs. The 12% error corresponds to 0.06 seconds. Such a short execution time 
exaggerates the distortion introduced by disk latency approximations. 

• Compress has the largest input file of all the workloads, 100K bytes, but its execu- 
tion time is only 1.32 seconds. The prediction error is mostly due to disk read-ahead 
phenomena, where reads-ahead requests to disk complete in the traced system but 
induce idle time in the untraced system. A comparison of idle time predicted by the 



Because of the large variability of running time induced by the Mach 3.0 page mapping policy, we do not 
present error figures for Mach 3.0. 
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trace/simulation system and idle time measured by the timing facility in the c-shell 
[31] confirms that the simulator does under-estimate idle activity. 

• Liv has the worst write-buffer behavior of all the workloads, and also has significant 
floating point activity. The prediction error is caused by the overlapping of write 
buffer and floating point activity that is not modeled in the simulator. 

Considering the known sources of error, the estimated execution times correlate well with 
measurements of execution time made with an accurate timer. Estimates of idle time are one of 
the dominant sources of error. As idle time has a negligible effect on cache performance, this 
source of error in execution-time predictions does not cause a significant distortion for simula- 
tions of memory system behavior for the restricted class of workloads we consider. Similarly, 
the simulation does not model the overlap of floating point delays with memory delays, but this 
has no impact on cache activity, as floating point delay has negligible impact on the pattern of 
cache misses that occur. Page mapping is another source of error, and the random policy used by 
Mach 3.0 causes much greater variation in execution times, with a subsequent loss of precision 
in time predictions. The good estimates of running time for most of the workloads demonstrates 
that the address trace collection is accurate, with errors in predicted execution time due primarily 
to inaccuracies in the modeling of the system rather than error inherent in the trace. 

5.2. User TLB Miss Count 

Using a kernel with a user TLB miss counter, we compared the TLB miss counts predicted by 
the simulator to TLB miss counts from an uninstrumented system (See Table 3). 



workload 


Mach 3.0 


Ultrix 


predicted 


measured 


predicted 


measured 


sed 


7493 


6438 


131 


190 


egrep 


6430 


6122 


164 


191 


yacc 


9270 


7494 


270 


318 


gcc 


53389 


48355 


29057 


29948 


compress 


91706 


89966 


79682 


79692 


espresso 


10351 


7252 


838 


1006 


lisp 


28605 


37919 


110 


179 


eqntott 


717428 


706915 


675166 


674579 


fpppp 


22816 


21893 


3256 


1894 


doduc 


48859 


39129 


6023 


3510 


liv 


2753 


2423 


70 


63 


tomcatv 


340968 


359976 


317872 


314950 



Table 3: TLB misses, measured and predicted. 

One source of error in the TLB miss predictions is explicit TLB writes from the kernel. The 
kernel sometimes avoids a user TLB miss by writing the TLB explicitly, using tlbdropin ( ) 
in Ultrix or t lb_map_random ( ) in Mach. In the simulator, which does not know about these 
writes, all TLB fills are caused by TLB misses. Kernel instruction reference counts for gcc 
showed about 1800 calls to tlbdropin ( ) for Ultrix, and 3700 calls to t lb_map_random ( ) 
for Mach. Also observe that the TLB uses a random replacement policy. The miss rates 
predicted by the simulator demonstrate a certain amount of error. Given the type of activity and 
its small impact on overall performance, this error does not detract significantly from the quality 
of overall measurements. These measurements demonstrate another end-to-end method that was 
used to evaluate and improve the correlation between simulated and real behavior. 
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6. Conclusions 

Our experience demonstrate that software methods can be applied to collect full address 
traces, with both system and user references. We have demonstrated instrumentation tools, trace 
formats and techniques that help insure trace quality, and measurements to establish that address 
traces reflect true system behavior. Traces from all three systems have already been applied to 
numerous problems in memory system and software design research [5, 7, 8, 9, 18]. 
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